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Abstract 

We study the asymptotic behaviour of a Bayesian nonparametric test 
of qualitative hypotheses. More precisely, we focus on the problem of 
testing monotonicity of a regression function. Even if some results are 
known in the frequentist framework, no Bayesian testing procedure has 
been proposed, at least none has been studied theoretically. This paper 
propose a procedure that is straightforward to implement, which is a great 
advantage compared to those proposed in the literature. We describe 
theoretical properties of this procedure and illustrate its behaviour using 
a simulation study and real data analysis. 
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1 Introduction 

Shape constrained models are of growing interest in the non parametric field. 
Among them monotonicity constrains are very popular. There is a wide liter- 
ature on the problem of estimating a function under monotonicity constrains. 



Groeneboom (19851, Prakasa Rao (19701 and Robertson et al. (1988) among 



others study the non parametric maximum likelihood estimator of monotone 



densities, 


Lo 


( 


1984 


1, 


Brunner and Lo 


(1989 


), 


Khazaei et al. 


( 


2012 


1 and 


Sa- 


lomond 


( 


20131 study the properties of a Bayesian estimator. 


Barlow et al. 



( 1972 1 and Mukerjee ( 1988 1 proposed a shape constrain estimator of monotonic 
regression functions. These methods are widely applied in practice. |Bornka mp 
and Ickstadt ( 2009 1 consider monotone function when modeling the response to 



a drug as a function of the dose and Neittaanmaki et al. (2008 1 use a monotone 



representation for environmental data. 

In this paper, we propose a procedure to test for monotonicity constrains. 
We consider the Gaussian regression model 



Yi=f(i/n) + ei, e, ~ Af (0,<r 2 ) ,a 2 > 0, i = l, 



(1) 
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and, with T being the set of all monotone function, we test 
H : f £ J 7 , versus H 1 : f T. 



(2) 



Thus both the null and the alternative are non parametric hypotheses. The 
problem of testing for monotonicity has already been addressed in the frequen- 
tist literature and a variety of approaches have been considered. |Baraud et al.| 
( 2005 1 use projections of the regression function on the sets of piecewise constant 



function on a collection of partition of support of /. Their test rejects mono- 
tonicity if there is at least one partition such that the estimated projection is 
too far from the set of monotone functions. Another approach, considered in 



Hall and Heckman (20001 and Ghosal et al. (20001 among others, is to test for 



negativity of the derivative of the regression function. However this requires 
some assumptions on the regularity of the regression function under the null 



hypothesis that could be avoided. In a recent paper Akakpo et al. (20121 pro- 



pose a procedure that detects local departure from monotonicity, and study very 
precisely its asymptotic properties. 

Here, we consider a Bayesian approach to this problem, which to the author's 
knowledge has not been studied. We only consider the case where T is the set of 
monotone non increasing functions, but a similar approach could be used when 
considering the set of monotone increasing or simply monotone functions. The 
most common approach to testing in a Bayesian setting is the Bayes Factor. 
Here however, we see that this method has drawbacks and seems to have poor 
performances. 



1.1 The Bayes Factor Approach 

Since monotone non increasing densities are well approximated by piecewise 
constants, see Groeneboom (19851 or Salomond (2013), it is natural to build a 



prior on such functions. The standard approach to test for monotonicity of / 
in a Bayesian setting would be to consider the Bayes Factor 



?r(/ e T\Y n ) l-n(T) 



where tt has the form, for all k > 2 and all / written as 



/ = y^J\(i-i)/k,i/k)Uj, dir(f) = 7r(fc)7r(wi, . . .,uj k \k). 



However, this approach seems to lead to poor results in practice. The reason be- 
hind this is that when / has flat parts, it becomes difficult to detect monotonicity 
due to estimation uncertainty. For instance when considering the function / = 
the Bayes Factor does not seem to give a credible answer. As an illustration, 
Figure [T] gives the histogram constructed from 100 draws of data with / = and 
n = 100. It appears that for these runs, the Bayes Factor is rather small and 
that for a non negligible proportion of samples the log Bayes Factor is negative. 
Thus the answers given by the Bayes Factor are not satisfying in this case. 
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log(BF) 



Figure 1: 100 simulation of the log Bayes Factor i?o,i for / = and n = 100 



1.2 An alternative approach 

To tackle this issue of constructing a test robust to fiat parts, we change the 
formulation of our test into 

H$ : d{J,F) < t versus ff? : d(f, F) > t (3) 

where d(f, T) is a distance between / and the set of monotone non increasing 
function and r a threshold. To perform such a test we consider the 70 — 71 loss 
with fixed 70, 71 > and thus our procedure can be define as 

^Z*{d{f,F)<T\X n )>^ 
I 1 otherwise 



This ideas is similar to the one proposed in Rousseau (20071 for the approx- 



imation of a point null hypothesis by an interval hypothesis testing, see also 



Verdinelli and Wasserman ( 1998 1 . The threshold can be calibrated a priori by 
a prior knowledge on the tolerance to approximate monotonicity. In practice 
such an a priori calibration is not always feasible. We therefore propose in this 
paper an automatic calibration of r. The idea of this construction is to choose 
r small enough such that the power of the test is not too deteriorated still re- 
maining robustness to flat parts under the null. The resulting procedure has 
good asymptotic properties, see Theorem^ but behave also well in finite sample 
situations, as shown in section [3] Furthermore form a practical point on view, 
this procedure will be easy to implement as it will only require sampling under 
the posterior distribution. This is a great advantage compare to the frequentist 
tests proposed in the literature as they require in general heavy computations. 
We calibrate r such that our test is consistent, that is for all p > and d(-, •) 



a metric 

supEj(^) = o(l) 

S " (5) 
sup EJ(1-^)= (1). 1 J 

f,dU,F)>p 

where d(f,J-) := ini ge jr d(f, g). In absence of prior information on the thresh- 
old, it is natural to have r depending on n, since the more data, the more 
precise we can afford to be. Hence, to understand better the effectiveness of the 
threshold induced by our approach, we study the minimum separation rate of 
our test which is the minimum value p — p n such that ^ is still valid. Small 
p n implies that the test is able to detect very small departure from the null. We 
want our calibrated threshold to induce the smallest separation rate. 



We thus propose a procedure which although being a Bayesian answer to the 
problem is also asymptotically an answer to the problem Q. Moreover, our 
procedure is automatic and easy to implement. The construction of the test is 
presented in section|2]and its asymptotic properties are discussed in Section [272] 
In Section |2.3| we propose a way to calibrate the hyperparameters of the prior 
rending the procedure fully automatic. We then run our test on simulated data 
in section [3] and on real environmental data in section [3] A general discussion 
is provided in section [5] 



2 Construction of the test 

In this sec tion we present our testing procedure based on the approach described 



in section 



1.2 



We propose a choice for d(f, T) which measures the distance 
between the regression function / and the set T ■ We also give an autocalibrated 
threshold r such that by answering the problem ([3| we give a good answer to 
the problem We then propose a specific family of prior together with a 
choice for the hyperparameters based on heuristics. 



2.1 The testing procedure 

As presented in section [T] monotone non increasing functions are well approxi- 
mated by stepwise constant functions. Let Gk be the set of piecewise constant 
function with k pieces on the partition {[0, 1/fe), . . . [(k — l)/k, 1]}. We denote 
fu,k G Qk the function 

fe 

fu,k(') = y] u i^Ki-D/n,i/n) (')■ ( 6 ) 
i=l 

In model 0, we consider the residual variance a 2 to be unknown. We then build 
a prior on (/, a) taking a prior on k and building a prior on each submodels Qk- 
We define 

7t(cj, a, k) := 7r(fc)7r(f7|fc)7r(w|cr, k) 
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First note that with this choice of prior we have generally speaking tt(J-) > 0. 
Furthermore, if the true regression function /o is in J- then the piecewise con- 
stant function of the form |6| which minimize the Kullback Leibler divergence 
with /o will also be in T . We consider the following discrepancy measure d(-, •) 

diU^.J 7 ) = H(uj, k) = max (uj - w») (7) 

k>j>i>l 

From ^ it appears that f u> ^ is in T if and only if d(f Ui k,J : ) = 0. Here the 
discrepancy d corresponds to the sup norm between f^^k and the set of monotone 
non increasing functions. The idea of the calibration is the following. In the 
model Qk, the a posteriori uncertainty for estimating to — {uj\ 1 . . . , Wfc) is of order 
yjk/n. Hence any monotone non increasing function f Uj ^ such that for all j > i, 
oji > ujj — 0(yjk/n) might be detected as possibly monotone non increasing. 
We thus decide to construct a threshold t% for each model Qk- We then compare 
ff (w, k) with some positive threshold depending on n and k. We then calibrate 
such that our procedure is consistent. Similarly to the frequentist procedures, 
we consider Holderian alternatives 

/ € H(a, L) = {/, [0, 1] -> K, Vx, y € [0, l] 2 |/(y) - f(x)\ < L\y - x\ a } 

for some constant L > and a regularity parameter a S (0, 1]. We study the 
separation rate of our procedure and compare it with the minimax separation 
rate n - a ^ 2a+1 \ 

2.2 Theoretical results 

The following Theorem gives a calibration for r^. It also gives an upper bound 
for the minimal separation rate with respect to the distance d n {-, •) defined as 

We define a prior tt on /, a similarly to before by considering 

k 

/., fc (-) = E^ I [( z - 1 )/ fc ' i / fc )(-) 

and 

fc 

dir(u, a, k) = ir k (k)ir a (<r) gfa) 

i=i 

where g and ir a are density function. We consider the following conditions on 
the prior 

CI The densities g and ir a are continuous, g(x) > for all x £ R and 7r CT (cr) > 
for all a £ (0, oo). 
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C2 7Tfe is such that there exists positive constants Cd and C u such that 

e -C d kL(k) < < e -C„fcL(fe) (g) 

where L(k) is either log(fc) or 1. 

The condition CI is mild as it is satisfied for a large variety of distributions. C2 
is an usual condition when considering mixture models with random number of 
components (see e.g. Rousseau (2010 1 ) and is satisfied by Poisson or Geometric 
distribution for instance. Under this conditions, Theorem[T]gives us some insight 
on how to choose r„. 

Theorem 1 Under the assumptions CI and C2, if Mq > 0, setting r = = 
Mo y/k log(n)/n and 6* the testing procedure 

51 = I {tt (H(u, k) > r k n \Y n ) > 70/(70 + 7i)} 

then there exists some M > such that for all a S (0, 1] 

supE?(<5£) = o(l) 

(9) 

sup E n f (l - 51) = o(l) 1 ' 

f,dn(f,F)>P,f£U(a,L) 
for all p> p n = Af(n/log(n))- Q /( 2Q+1 ). 

Note that neither the prior nor the hyperparameters depend on the regularity 
a of the regression function under the alternative. Moreover for all a G (0,1], 
the separation rate p n (a) is the minimax separation rate up to a log(n) term. 
Thus our test is almost minimax adaptive. The log(n) term seems to follow 
from our definition of the consistency where we do not fix a level for the Type 
I or Type II error contrariwise to the frequentist procedures. The conditions 
on the prior are quite loose, and are satisfied in a wide variety of cases. The 
constant Mq does not influence the asymptotic behaviour of our test but has a 
great influence in practice for finite n. A way of choosing Mo is given in section 
1231 

The proof of Theorem [l] is given in Appendix [A] we now sketch the main 
ideas. We approximate the true regression function f in each submodel Gk of 
piecewise constant functions associated with k pieces on {[0, 1/fc), . . . [1 — 1/fc, 1)} 
by fu°,k by fui°,k that minimize the Kullback-Leibler divergence with /o. We 
get a close form expression for u>° = (wj 1 , . . . , wjj!) defined by 



j,j/n&[(i-l)/k,i/k) 



2^ foti/n), rn = Card {j,j/n € [(i - l)/k,i/k)} (10) 



thus / w o & belongs to T for all k when /o G T . To prove the first part of ([9]), 
we bound H(ui, k) < 2 max — if /o G J- so that the threshold needs 
to be as large as the posterior concentration rate of u> to oj° in the misspecificd 
model Qk- Then to prove the second part of (TOJ) when p = p n (a), we bound 
form below H(oj, k) by H(lu°, k) — 2 max \u)i — wf| which implies a constraint on 
the separation rate of the test to ensure that uniformly over d n (fo,F) > p n (a) 
and / G T-L(a, L) we have H(u>, k) > r„. 
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2.3 A choice for the prior in the non informative case 

Conditions on the prior in Theorem [I] are satisfied for a wide variety of dis- 
tributions. However, when no further informations are available, some specific 
choices can ease the computations and lead to good results in practice. We 
present in this section such a specific choice for it and a way to calibrate the 
hyperparameters. We also fix 70 = 71 = 1/2 in the definition of 8^. 

A practical default choice is the usual conjugate prior, given k, i.e. a Gaus- 
sian prior on u! with variance proportional to a 2 and an Inverse Gamma prior 
on a 2 . This will considerably accelerate the computations as sampling under 
the posterior is then straightforward. Condition Q on TTk is satisfied by the two 
classical distributions on the number of parameters in a mixture model, namely 
the Poisson distribution and the Geometric distribution. It seems that choosing 
a Geometric distribution is more appropriate as it is less spiked. We thus choose 

( k ~ Geom(A) 

tt:= ia 2 \k~ IG{a, b) (11) 

[wj|fc, a ~ Af(m, o 2 1 n) 

Standard algebra leads to a close form for the posterior distribution up to a 
normalizing constant. Denoting rij = Caid{i,i/n G [(J — l)/k,j/k)} and 

where Yj is the empirical mean of the Y/ on the set {l,l/n S [(j — l)/n,j/n)}, 
we have 

n k (k\Y n ) oc 7r(fc)^ (a+n/ V /2 f[( nj + p)- 1 ' 2 

i=i 

We can thus compute the posterior distribution of k up to a constant. To 
sample from 7Tfc we will use a random walk Hasting-Metropolis algorithm. We 
then compute the posterior distribution of u and a given k 

a 2 \k,Y n - IG(a + n/2,b k ) 

mn + rijYj a 2 



Wj\k,a 2 ,Y n 

\ rij + jU rij + fi / 

Note that given k, sampling from the posterior is straightforward. We now 
give a way to calibrate the hyperparameters a, b, /i, m, Mq. We first calibrate Mq 
the constant in t% . The most difficult function in T to be detected as belonging 
to T are the constant functions. We calibrate Mq by choosing the smallest value 
such that Ej(<5^) < a for / = and for reasonable sample size n. Using the 
fact that the are a posteriori Gaussian and denoting z n = 1+fc ^2/„ we have, 
assuming that Vj, n.j = n/k, 

7r(H(tJ,k) >T*\Y n ,k,o) =tt( max - ^ +\P^{U j -U i )) > r*\Y n ,k,c 
V ' / \i<i<j<k x z n V nz n ' 
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where Ui l ~ Af(0, 1). This implies that 

%(H{u,k) >T*\Y n ,k,a) < 1 - $ Unz n /2k 



l<i<j<k 



k Yj Yi 



l<i<j<k 



,k 



Given that / = we have that Yi ~ JV(0,na 2 /k). In this case we can easily 
prove that when / = 0, ir(k — 2\Y n ) = 1 + opn(l) (using the same approach as 
in Lemma [3] given in Appendix |A| . Thus restricting our attention to k = 2 and 
a close to ctq, we have 



Yi - Y 



E](SZ) < Pf ( $ ( v /^72fc(^_^ - T *) ) > 1/2 

>M 




j^-Ki ^ ^ . /log(n)^ 



and thus choose 



\A>g(n) 



where ct is the posterior mean of <r|fc, F™. And thus have Ej =0 (<5^) < a. 

We now propose a calibration for a, &, m, and A. We first choose m to be 
the empirical mean of the Yi. We then chose a and b such that the prior on 
a has a first order moment and E Tr (cr 2 ) is of the same order as <jy. We choose 
a = a 2 + 1 and b — a^. We want the prior on oj to be flat enough to recover 
large variation from the mean m. This is done by choosing the hyperparameter 
ix small enough. We also want the prior on k to be flat to allow large values of 
k even for small samples sizes. We calibrate [i and A on simulated data when 
/ = and a — 1, and choose the minimum values such that Eq(S^) < 0.05. 



3 Simulated Examples 

In this section we run our testing procedure on simulated data to study the 
behaviour of our test for finite sample size. We choose the prior distribution 
and calibrate the hyperparameters as exposed in section [2. 3| We consider nine 
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functions adapted from Baraud et al. (2003) and plot in Figure [2j 

/i(x) = - I5(a! - 0.5) 3 I x < 1/2 - 0.3(aj - 0.5) + e" 250 ^- - 25 ) 2 

f 2 (x) =0.153! 

/ 3 (a;) =0.2e- 50 ^- - 5 ) 2 
fi(x) — — 0.5cos(67ra;) 
/ 5 (a:)=-0.2a; + /3(a:) 
f 6 (x)=-0.2x + f 4 (x) 

f 7 {x) = - (1 + x) + 0.25e- 5 °( x -°-V 2 
f a (x) = -0.5x 2 
fo(x) -0 

The functions fi to /g are clearly not in J- . The function /V has a small bump 



(12) 
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Figure 2: Regression functions used in the simulated example. 



around x — 0.5 which can be seen as a local departure from monotonicity. This 
function is thus expected to be difficult to detect for small datasets given our 
parametrization. The function fg is a completely flat function which is the most 
difficult situation under Hq- 

For several values of n, we generate N = 500 replication of the data Y n — 
{yi,i — 1 . . • n} from model 0. For each replication we draw K — 5.10 3 
iterations from the posterior distribution using a Hasting-Metropolis sampler 
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Table 1: Percentage of rejection for the simulated examples 





fo 


a 2 


Barraud et 


Akakpo et 




Bayes Test, 


n : 








al. n = 100 


al. n = 100 


100 


250 


500 


1000 


2500 




h 


0.01 


99 


99 


100 


100 


100 


100 


100 




h 


0.01 


99 


100 


98.4 


100 


100 


100 


100 




h 


0.01 


99 


98 


98.8 


100 


100 


100 


100 




h 


0.01 


100 


99 


100 


100 


100 


100 


100 




h 


0.004 


99 


99 


99.5 


100 


100 


100 


100 




h 


0.006 


98 


99 


100 


100 


100 


100 


100 




h 


0.01 


76 


68 


18.6 


41.8 


66.6 


86.4 


100 


Hi 


h 
h 


0.01 
0.01 




2.4 
5.6 


1.8 
5.1 


1.6 
5.2 


1.4 
5.1 


3.6 
4.2 



with a compound Geometric proposal. More precisely, if the state of our 
Markov chain at the step i, we propose 

k\ = ki-!+Pi 

where pi is such that 



\pi\ ~ Geom(0.3) + 1 
P( Pl < 0) = Pfa >o) = l 

Given k we draw directly a 1 and u> from the marginal posteriors. We then 
approximate tt (H(lu, k) > t„|Y™J by the standard Monte Carlo estimate 

i=l 

and reject the null if n (H(u>, k) > r^\Y n ) > 1/2. The results are given in table 

m 



For all the considered functions, the computational time is reasonable even 
for large values of n. For instance, for /i, we require less than 25 seconds to 
perform the test for n — 2500 using a simple Python script available on the 
author's webpage. For the models with regression function fx to fj, we choose 
the same residuals variance as inlBaraud et al. (2003), for the last two functions, 
we choose a variance of 0.01 which is of the same order. We observe that for 
the regression functions f± to fe, the test perform well and reject monotonicity 
for almost all tested samples even when n is small. The results obtained for 
n 



100 are comparable with those obtained in|Akakpo et al. (|2012|) and Baraud 



et al. (20031. We observe a consequent loss of power for / 7 for small sample 



size. For this last function, the deviation from monotonicity is small and local, 
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making it hard to detect using piecewise constant functions. In fact, computing 
H(uj Q ,k) for this function for different values of k shows that we need k > 13 
get H(uj°, k) > and thus detect departure from monotonicity. However with 
our choice of hyperparameters, the posterior distribution puts most of its mass 
on small values of k when n is small. 

It thus appears that our procedure requires a larger amount of data to detect 
local departure from monotonicity than the frequentist ones. Nonetheless, our 
test is easy to run on large datasets and has good performances when n > 1000. 



4 Application to Global Warming data 



We consider the Global Warming dataset provided by |Jones et al. (2011 1 plotted 
in Figure^ It contains the annual temperatures anomalies from 1850 to 2010, 
expressed in degrees Celcius. Temperature anomaly is the departure from a 
long-term average, here the 1961-1990 mean. The data are gathered from both 
land and sea meteorological stations and corrected for non climatic error. In 
the literature, this dataset has been used to illustrate some isotonic regression 



techniques in Wu et al. (20011 and Zhao and Woodroofe (20121 where they use 



frequentist estimation procedures under monotonicity constraint. |Alvarez and 
Dey ( 2009 ) show, using a Bayesian monotonic change point method, that there 
is a positive trend, and that this trend tend to inc rease of about .3° C in the 



Alvarez and Yohai 



(2012) 



global annual temperature between 1958 and 2000 
show that the phenomenon of global warming is due to a steady increase trend 
phenomenon using a isotonic estimation methods. In our model, that would 
mean that the regression function / should be positive increasing and convexe. 
In all these papers the data is supposed to be a sequence of independent and 
identically distributes random variables. This assumption is questionable (see 
Fomby and Vogelsang ( 2002[ )), but considering annual temperature anomalies 
should reduce the serial correlation. Similarly to these authors, we make the 
same assumption of independence. Our aim is to test if the hypothesis of in- 
creasing temperature anomaly is realistic, given the amount of information, 
using the method described in section [T] In particular, we choose the prior and 
the hyperparameters based on the rule described in section [2] 

We perform our test on this dataset (more precisely on minus the temper- 
ature anomalies to test for monotone increasing trend), choosing the hyperpa- 
rameters as in section |2.3| We run the MCMC sampler described above for 



K = 10° in order to compute Monte Carlo estimate of 6*. We obtained 

n(H(u,k) >T*\Y n ) = 0.98 

and thus the hypothesis of monotony is ruled out by our procedure. We conclude 
that applying a shape constrained regression techniques on the trend of this 
dataset can deteriorate the estimation results. 
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Figure 3: Plot of the Global Warming data 

5 Discussion 

In this paper we propose a Bayesian approach to the problem of testing qual- 
itative hypotheses in a non parametric framework. More precisely we address 
the problem of testing monotonicity of a regression function. This problem 
arise naturally as shape constrained models, and monotonicity in particular, 
are fairly used in practice. Our approach is particularly interesting as it focuses 
on a problem where the Bayes Factor seems to give poor results and thus an 
alternative approach should be considered. 

The testing procedure proposed in this paper is a modified version of the Bayes 
Factor that only reject Hq when the data gives strong evidence that the func- 
tion is not monotone. When possible, one can choose a threshold based on prior 
information on the tolerance level to non monotony. However, this could be 
difficult in practice, we thus present a way to calibrate our test such that it 
behave well asymptotically. Interestingly this calibration leads to the optimal 
separation rate (up to a log(n) term) and thus the tolerance induced by our 
approach, and the fact that we test ([3| (Hq versus Hf) instead of ^ (Ho ver- 
sus Hi) is of the same order as the classical tests available in the literature. It 
has the advantage of being very simple to implement even in presence of large 
datasets. Although we have focused on monotonicity constrains, other types of 
shape constrains such as convexity or unimodality can be dealt with using this 
approach. For instance we can test for convexity using piecewise linear functions 
as submodels Qk and test monotonicity of the slope. 
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A Proof of Theorem Q] 

In order to prove Theorem^we need some concentration results of the posterior 
around the true regression function. The following Lemma provides a posterior 
concentration rate when f is either in T or in H(a,L). The proof is given 



in appendix p51 and is derived from Ghosal and van der Vaart (20071. Some 



adaptive results are known for the Gaussian regression under some regularity 
assumptions, the monotone case has not been studied and thus this Lemma has 
an interest in its own. 
Let d n (-, •) be define as 

d n {f,gf = n- l Y^{f(i/n)- g{i/n)f 

and denote Pq the distribution of the Yi when / = /o in . 

Lemma 1 Let f be either in J- or in T-L{ol,L), and let tt be defined as in 
Theorem [7J Thus 

Ep - (7r(d„(/ u>fe - f f + (a - a ) 2 > Me 2 n \Y n )) -> 

where e„ = (n/log(n)) 1/3 if f G T and e n = (n/ log(n))~ Q /( 2Q+1 ) if f G 
H(a,L). 

The proof of this lemma is postponed to Appendix [B] Given this result, we get 
the following Lemma that enable us to derive consistency and an upper bound 
on the separation rate. 
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Lemma 2 Let M be a positive constant and p n (a) = M(n/ log(n)) q /( 2q + 1 ). 
Let ir be as in Theorem^ and cuo be the minimizer of the Kulback-Leibler diver- 
gence KLif^.k, f ). Then 

7T (max \ Ui - «?| > CC\Y n ) < 1/2 + 0p »(l). (13) 

where is either equal to t% when fo E J- or p n (a) when £ T and /o € 
H(a,L). 

The proof of this lemma is postponed to Appendix [B] Given the preceding 
results, we derive ([9|. 

Consistency under Hq Let fo £ T then 

H(ui, k) < 2 max |cjj — cj°| 

and thus 

n(H(u,k)<T*\Y n )<l/2 + o P «(l) 
which gives the consistency under Hq given Lemma [2] 

Consistency under Hi and upper bound for the separation rate Let 

fo^T and fo £ H a (L) we have 

H(u;,k) > H (u) ,k)-2max\u)i-oj? | (14) 

Assume that p n (a) < d n (fo,J-'), we derive a lower bound for H(cu ,k). Let g* 
be the monotone non increasing piecewise constant function on the partition 
{[0, 1/fc), [{k - l)/k, 1)}, with for 1 < i < k, g* = min^ Given that 
d n {f u o,k,F) = inf g( ;jrd n (f u o >k ,g) we get 

And therefore, given that d„(/ , J 7 ) < d n (f u o k , F) + d n (f u o <k ,f ) 

7r (i?(o;, fc) < Ct„ |Y n J < 7r I max |w, - w 4 1 > \Y I 

The following Lemma states that the posterior probability that k is greater that 
K n 1/( - 2a+ V is less than a o P n(l). 

Lemma 3 Let K n = {k < K (n/ ^log(n)) 1 ^ 2a+1 ^} . Lfn is define as in Theorem 
[7] and fa 6 %{a,L), then 

7r(K c n \Y n )<o P n(l) (15) 
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The proof is postponed to Appendix [B] 

For k £ JC n and M large enough we have p n (a)/A > r*. Denoting B n = 
{d n {fw,k, fo) 2 + Wo - &\ 2 < <4}, Lemma [l] gives 

ir(B c n \Y n ) = o P n(l). 

On the set B n n JC n we have for R large enough p„(a)/4 > d„(/ w o /o) 

7T (fT(w, fc) < Cr*|F n ) < 7T ({max |w< - w?| > p„(a)/8} n {/C„ n B n }|Y")+ 0i »(l). 



Given ( 13 1, we get that for all f such that d n (fo , J 7 ) > p„ (a) 



1 



n(H(u;,k) <r«\Y n ) <- + 0p »(l) 



which ends the proof. 



B Proof of Lemmas [T], [2] and [3] 
B.l Proof of Lemma |l] 

In this section we prove that the posterior concentrate around / , <jq at the rate 
(n/logO))" 1 / 3 if f £ T and (n/ log(n))- a /( 2a+1 > if fo £ H (a,L). To do so 
we follow the approach of Ghosal and van der Vaart (20071. Let KL(f,g) = 
//log (//g) be the Kullback-Leibler divergence between the two probability 
densities / and g. We define V(f,g) — /(log(//p) — KL(f,g)) 2 f. We denote 
Pi(uj,(j,k) the probability measure of Yi = / w ,fc + and pi(uj,cr,k) its density 
with respect to the Lebesgue measure when e, ~ Af(0,cr 2 ) and P^o the true 
distribution of Yi and p^o its density. We only consider the case where / 6 J-, 
a similar proof holds when / S H(a,L). We define 



-Bn(e) = y^.KX(pi(a;,cr, fc),fft,o) < ne 2 , ^ V(p,-(w, a, k),pi, 0) < ne 2 



»=i 



Here cr, fc) and po are Gaussian distributions, we can easily compute 



#£(Pi(w,cr,fc),p i)0 ) = -log 



2 V a 2 / 2 a 2 



7( Pi (cJ ! (7,fe),pi,0)=- 1--| + 



{fu,k( X i) ~ fo(Xi)) 



We have B n (e„) D /„) < Ce n , |a 2 - CT 2 | 2 < Ce 2 }. 

For /o G J 7 , denoting w° = nj 1 Y^x^i, fo(xi) and = mf (^)> ^7 = su p(-A) 
we have 

dn(fw,k, fa) = d n (fo, f u °,k) + d n (f u ,k, fu°,k) 



1G 



and 



k 



< 



3 = 1 Xi£lj 
1 k 




fix])? 



< 



c\\fo\\l 



Denoting k n = C\{n/ log(rt)) 1 / 3 ] we deduce that B n (e n ) D {k = k n 



, ,0112 < ,2 

We deduce that 



< ei, \a' A - crgj < e^} w here 



is the standard Euclidean norm in 



n(B n (en)) > C inf (7r w (/ (x)))e„ 7r (T (ag) e ^(fc = fc„) > e - Co " e " (16) 
\xe[o,i] J 

To end the proof of Lemma [T] the standard approach of |Ghosal and van der| 



Vaart| ( |2007| require the existence of an exponentially consistent sequence of 



tests. Their Theorem 4 suited for independent observation rely on the fact that 
the set {d„(/ Ui A;, fo) 2 + (c — <r ) 2 > e^} can be covered with Hellinger balls. 
Because of the unknown variance, this cannot be done here, we thus use an 
alternative approach and to construct tests, and then apply Theorem 3 from 
Ghosal and van der Vaart| ( |2007| . 



, - {XX r, (J'e») 2 < d n (U.k, fo) 2 + {?- oo) 2 < (U + 1)« 
There exists a constant C > such that 



Consider the sets J- k 



T k C 

3 



\U> — CO 



< 2je n ,\a-a°\ < 2je n } 



(17) 



To apply Theorem 1 of Ghosal and van der Vaart ( 2007[ ), we construct test 
following Choi and Schervish ( 2007 1 . 

For |cr — o"o| < oo/2. Simple algebra leads to an equivalence between 

1/2 

(d n ( f, f') 2 + ip — o') 2 ) and the Hellinger metric so that we can apply Lemma 
2 of |Ghosal and van der Vaart ( |2007| ). Equation (17 1 implies that for all £ > 
there exist a £e n net of T k containing less than (Dj/^) k point with D > 0. We 
then have a test ^5>i such that 



E?(*i) < e" 



3 ne n . 



sup 



^n{\a--a a \<cr /2} 

For a > 3oq/2 we consider the test ^ defined as 



E/, CT (l-*i) <e 



3 ne n 



E 

. i=l 



Yi - f (xi) 



> nc\ 
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for a suitably choosen constant c\ > 0. Chernoff bound gives 

E£(* 2 ) < e- Cn 

for some constant C > 0. Note that if a > 3<r /2 and (/, a) € Tj, thus j > joAn 
for some j > 0. If Y, t = f(x t ) + as, where s t ~ 7V(0, 1) then £? =1 ( ^"f" ^ ) 2 
follow a non central x 2 distribution with non centrality parameter Y^i=i(f( x i) ~ 
fo{xi)) 2 /a 2 > 0. Thus setting W ~ xl 

E/ , CT (1 - * 2 ) = P f , a 2 < « c i) < iV < cn^ 

Chernoff bound gives 

For cr < (Jo/2 we consider the test ^1 associated to /* € J 7 ^ a point in the j£e n 
net and some suitably choosen < c 2 < 1 defined as 



. »=i 



CTO 



Similarly to before, given that under P/ 0lCro , ( ~ g ^^ ) f°ll° ws a non 

central x 2 distribution 

Given that the moment generating function of a non central Xn distribution 
with non centrality parameter A at point s is known to be (1— 2s)™/ 2 expjsA 2 / (1- 
2s)}, we have for all /, a e J" fc n {a < <tq/2} such that rf„(/*, /) < £e„ 



< 



exp { I (- log(l - 2s) + ^j^dnif, ff - 2sc 2 } 



For s small enough we have 

-^-MfJr < ^(fjr < M 2 el < 2sc 2 f 2 . 
Which in turns gives 

E /i(J (l-*£) < e- nc * 
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Taking "J 3 = max t ^3 we get a test such that 

E£(* 3 ) = o(l); sup E /!ff (l-*3)<e- 32 ™ e " 

.F£n{<7<CT Q /2} 

We conclude the proof by taking <f> n — max^i,^)^} as an exponentially 
consistent sequence of tests and applying Theorem 3 of |Ghosal and van der| 



Vaart (20071 



B.2 Proof of lemma [2] 

Denoting A n — {^(/^^ — /q) + \a — cr 1 2 < e 2 } for e„ as in Lcmmajl] we have 
7r(maxK-u;?|>a^") < 

(18) 

7r(fc|F n )^({max \u t - u$\ > H A n \Y n , k) + o P o(l) 

k=l 

Let 

o, ^ . Nt Im^\ij,-ij \>^nA n p(S ,'<7o ,fe) (^"^(w. O") 

7r({max tai-w? > £„}nA n F , fc) = — ^ = — — ' , " " 

For a fixed k, let 



Pi{uj,a, k) 



n 

i=l 



1 / 

Note that 



Pi(cj,a,k) \ ( ( pi(u,a,k) 



KL*(U, k , f u o tk ) = \ log(a 2 / ( x 2 ) - hi + ^(n- 1 £ f( Xl ) 2 - 



2 

k 



'0 



1 



2 , ,2 



/c<7 2 ' — ' \ (T 
t=l 
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Denoting = {KL*(U, k , f u o <k ) < (^) 2 , V*(f u , k , / w o, fc ) < (^) 4 )} we 

have that B*(e„) D {d n (/ u , fc , / Wo , fe ) 2 < C(^) 4 , |<r - a | 2 < (f*) 4 }, and deduce 



(19) 



Following the proof of Lemma 10 of Ghosal and van der Vaart (20071 we get 
that 

P£{D k n > e-V+^ 2 ) < ^ (20) 

We now derive an upper bound fo r N% for a fixed k with high probability. 
Following Kleijn and van der Vaart ( 2006 1 we construct a test 0„ between the 
measures Pq and Q Wt<T ,k defined as dQ^^^k = p /p(uj , a , k)dP Uj ^k such that 



sup 

(u;,CT),max \ujj — uj® |>7 



(21) 



The construction of such test is commonly used to derive convergence rate of 
the posterior distribution, and general results are known when P and Q are 
probability measures (see Le Cam 1986 Chapter 16.4). In the case of Gaussian 
regression, we compute 



dQ u ,„,k{Y n ) =e 



io 2 - 2 «(f|-i)(n- i E: =1 /o^) 2 -Ej =1 ^; 



nn 



%/2ttct 2 



(22) 



Note that for (uj,a,k) € A n the first term in (22 1 is bounded by e ™ where A 
is a positive constant that may depend on / . We thus fall into the usual setting 
of testing between two probability measures. We thus use the same approach 
as in Ghosal and van der Vaart (20071 to construct a test that satisfy (21 1. We 



first get an upper bound for the metric entropy 



N(&) = N a) G A n , max |^ - W °| < £*} , &») 



Note that 



f (w.ff) G A„,max | Wj - C {^(/,,,/w ) < & k - ^ol < e n } (23) 

We can thus derive an upper bound for N(£%) < (£„)~ fc - We thus build a test 
4>n satisfying (21) in the same way as in Ghosal and van der Vaart (2007). We 
thus compute 
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( N k 

pn / 1 'to 

\m 



Ia„ > 1/2 < 2E 



n l x 1 n 



N k 



D k 

k 



< 2E 



2C 



< e- c "«») E£ (N k (l - <£„)) + — + 2e- c ^"> 



o 



: \ui—u?\>c 



2C 



C" 



•,fe(l - 0„)c?7r(w, cr) + 



(24) 



If fo £ J 7 , we have nej; sw log(n) and thus the first term of (24 1 is less than 
e -Cfcio g (n)^ We thus have 



f N k \ 1 

- " 1/2 ) ~ n 



Given (18 1 we deduce ( 13 1 for fo £ T . If fo £ T and f £ H a (L) we have that 

for M large enough Ane^ < Cp n {a)/2, then the first term in (24 1 is less that 
e -C Pn (a)/2 and thug 

1 



, N k 

D k ' 



> 1/2 



< 



which gives (13 1. 



B.3 Proof of Lemma |3] 

Let fo be in H a (L) and fc„ = (n/log(n)) 1 /( 2Q+1 ). Similarly to before, we have 
tt (B n (e n )) > e~™ £ >». We define iV„ and £>„ such that 



E k <k)f*-^(Yn)dir(u,a) D n 



Given Lemma 10 of Ghosal and van der Vaart] ( 2007[ ) ; we have 



°o n ( 



Pn D n <e 



-Cnet 



0(1) 



Note also that 



K(N n ) = V 7r(fc) / / P( "'' 7 ' fc) (y")p (y n )d7r(^,(7)dy" = vr(fc < fc„) < C e~ c ^ L ^ 



Thus for C small enough we have 
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E£ [7r(fce/Q|Y")]=ES 



Nn- 



< e Cn <ce 
<o(l) 



C u k n L(k n ) 



+ 0(1) 
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